Stabilize benchmark signal with variance-aware sampling and rolling baselines#4622
Conversation
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
|
No changes needing a change description found. |
|
You can try these changes here
|
|
@copilot fix format, spellcheck and run benchmark ci |
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
Addressed in commit
|
|
@copilot run benchmark workflow still failing |
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
Fixed in commit Root cause was Node engine mismatch in the benchmark workflow ( |
|
@copilot benchmark run into OOM |
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
... Fixed in I updated the benchmark workflow to run the benchmark CLI with a larger Node heap ( |
|
@copilot still OOM, increasing memory doesn't seem to be the right solution anyway, what is causing this increase in use, is there a memory leak with an emitter or the benchmark code? |
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
Root cause is memory accumulation in the benchmark process, not a single emitter regression: repeated in-process |
Benchmark results were overly sensitive to GitHub runner noise, causing high run-to-run deviation and unreliable PR deltas. This change makes benchmark comparisons more stable by increasing sample quality, gating on variance, and comparing against a rolling mainline baseline instead of a single latest run.
Sampling and runner stability
warmup=3,iterations=25).Variance-aware execution (noise gate)
--noise-cv-threshold--max-reruns--rerun-iterationsRolling baseline for PR comparisons
mainhistory (results/history.json) with fallback toresults/latest.json.--baseline-windowto control rolling window size.Benchmark output and docs updates